The final exam consists of 10 programming exercises, each representing one module from the course. In order to receive a base-level pass, you must successfully complete 5 of the 10 exercises; in order to receive an advanced-level pass, you must successfully complete 8 of the 10 exercises.
R code as well as any additional output requested from the problem.Your final file(s) should be submitted via the dropbox on the course page (under the Capstone Project section of the site). If you have any questions during the exam, you should not hesitate to ask the instructor.
By taking this exam, you assert that you neither gave nor received unauthorized aid during the course of the exam.
The ability to easily store and access data is changing the way that cities conduct government business. Local governments collect data on all utilities they oversee, including traffic stops, utility usage, etc. Many cities across the world have decided to make their data open to the public. Bloomington, Indiana is one such city (https://data.bloomington.in.gov/).
When citizens have non-emergent requests for city services (trash can was not picked up, street light out), they contact city governament. This is typically done by dialing “311.” Open311 is a protocol for collecting and reporting data falling into this category. All service calls made in 2015 has been made available by Bloomington. We will be working with this data throughout. The necessary datasets available to complete the final exam are available on the course page (FinalData.RData). Assuming the RData file has been placed in the same directory as your R code, you can load these datasets using the following command:
load("FinalData.RData")
This loads the following elements:
open311.df: A dataframe containing various characteristics about each service request. It contains the following variables:
ID: A unique identifier for the service call.Status: Indicator of whether the call had been resolved at the end of the year.Service_Name: The name of the service performed.Service_Code: A code for determining the type of service performed.Agency_Responsible: The agency responsible for handling the call.Date_Requested: A datetime object (see lubridate package for details) indicating the date and time the service call was initiated.Date_Updated: A datetime object indicating the date and time the call was last addressed. If the call has been resolved, this is the date it was resolved.Description: Text description of the service call.Latitude: Latitude (in degrees) for the location of where the service should take place.Longitude: Longitude (in degrees) for the location of where the service should take place.Distance: Distance (miles) the service location is from the city center.Weekday_Requested: The day of the week the request came in.Weekday_Updated: The day of the week the request was updated.Service_Time: The time, in minutes, between when the service was requested and the last update to the record.You will also need the following file, which is on the course site.
open311.xml: XML file downloaded from the following website; this is the original form of the data.Consider the following two questions:
To address these questions, construct the following statistics for each day of the week:
Specifically, construct code which results in the following summary:
| Weekday Requested | Number of Calls | Percent Parks and Rec |
|---|---|---|
| Sunday | 186 | 2.1505376 |
| Monday | 1792 | 0.2232143 |
| Tuesday | 1750 | 0.4000000 |
| Wednesday | 1237 | 0.2425222 |
| Thursday | 1026 | 0.1949318 |
| Friday | 419 | 0.7159905 |
| Saturday | 135 | 0.0000000 |
One of the variables in open311.df is Distance, the distance of the request location from the city center. This was computed using the “haversine” formula to calculate the great-circle distance between the two locations — the shortest distance over the earth’s surve “as the crow flies.” The haversine distance, \(d\) is given by the following formula:
\[ \begin{aligned} x &= \sin^2\left(\frac{\phi_2 - \phi_1}{2}\right) + \left[\cos\left(\phi_1\right) \cdot \cos\left(\phi_2\right) \cdot \sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)\right] \\ y &= 2 \cdot \text{atan}_2\left(\sqrt{x}, \sqrt{(1-x)}\right) \\ d &= R \cdot y \end{aligned} \]
where
atan2).Write a function haversine() which returns the haversine distance and takes the following parameters:
lat1: the latitude (in degrees) of a location.long1: the longitude (in degrees) of a location.lat2: the latitude (in degrees) of a location.long2: the longitude (in degrees) of a location.miles: a boolean; if TRUE (default), the distance in miles is returned; if FALSE, the distance in kilometers is returned.Note, the function should be vectorized over the parameters lat1, long1, lat2, long2; that is, these should accept vectors as arguments.
The latitude and longitude (in degrees) for the city center of Bloomington, Indiana is (Latitude: 39.1653, Longitude: -86.5264). You can test your function by recreating the Distance column of open311.df.
We would like to graphically examine the relationship between the time required to complete a service request and the distance the request is from the city center. In addition, we would like to dtermine if this relationship tends to differ depending on the day the service was requested. This could be captured in the following graphic:
You are to recreate the above graphic. Note the following elements should be included:
When a call is taken, the operator generally provides a description of the request being made; in addition, the type of service requested is determined (there are 45 unique names given to service calls). For all calls received (regardless of whether the status is open or closed), determine the following:
We have been working with a fairly clean dataset. However, this is not how the data is available online. The file open311.xml is the original file downloaded from the website (see above for link). You are to use this file to construct a version of the open311.df dataset. In particular, construct a dataset MyOpen311.df (from the open311.xml file) which contains the following variables:
ID: The unique request identifier.Status: The status of the request.Service_Name: The name of the service requested.Your dataset, if constructed correctly, will match the records of open311.df.
Consider the following logical questions you might ask with this dataset:
To get at this, we consider an interactive graphic which allows us to examine where in Bloomington service calls are originating. You are to replicate the following graphic.
Note the following attributes of the graphic:
The “Street Department” received 140 requests regarding “Street Lights” over the course of the year. Suppose that these calls are representative of what the Street Department in Bloomington receives regarding street lights. In particular, assume that they can be used to represent the time between such calls for this department.
Further, suppose we are willing to make the following assumption: let \(X\) represent the cost of making a repair to a street light; assume \(X \sim N\left(1900, 300^2\right)\). That is, the cost varies according to a Normal distribution with a mean of $1900 dollars and a standard deviation of $300 dollars.
Suppose the city has set aside $300,000 for repairs to street lights for the year. Will this be enough funds? Conduct a simulation to determine the following:
Your annual costs should have the following distribution (based on 10,000 simulations).
The director of the Sanitation Department would like to reference this data when making requests for funding. In particular, he would like to make a statement like “the Sanitation Department is \(\theta\) times more likely to receive a request for service than the Street Department.” He could estimate this by value by considering the following:
\[\widehat{\theta} = \frac{\text{Number of Requests for Sanitation Department}}{\text{Number of Requests for Street Department}}\]
However, having had a course in statistics, he would like to present a 95% confidence interval for this estimate. Write a bootstrap algorithm to compute a 95% confidence interval for this estimate. You may assume that the calls for the year provided are a random sample of typical call volume/type.
Your interval should be roughly (based on 5000 replications):
2.5% 97.5%
5.011411 5.869312
We now use the few features in the Open311.df dataset to build a machine learning algorithm to predict the length of time (minutes) to close a call. Write code which does the following:
mtry parameter over a grid of integers between 1 and 6 inclusive, using 5-fold cross-validation.Service_Name, Agency_Responsible, Distance, Weekday_Requested.Note: due to the complexity of the response, it can take a while for the model to finish running. The MSE for your model on your test set should be similar to the following (the model is not very good):
RMSE Rsquared MAE
18726.059 0.206 7298.388
Write your solutions to the exam using R Markdown and submit your code and any relevant output as an HTML page. A couple of things regarding your solutions: